<?php
/**
 * <https://y.st./>
 * Copyright © 2019 Alex Yst <mailto:copyright@y.st>
 * 
 * This program is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 * 
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
 * GNU General Public License for more details.
 * 
 * You should have received a copy of the GNU General Public License
 * along with this program. If not, see <https://www.gnu.org./licenses/>.
**/

$xhtml = array(
	'<{title}>' => 'Limitations of linear regression',
	'takedown' => '2017-11-01',
	'<{body}>' => <<<END
<img src="/img/CC_BY-SA_4.0/y.st./weblog/2019/02/16.jpg" alt="The bike path crosses the bus lane" class="framed-centred-image" width="649" height="480"/>
<section id="drudgery">
	<h2>Drudgery</h2>
	<p>
		My discussion post for the day:
	</p>
	<blockquote>
		<p>
			One limitation I noticed when reading the textbook today is that linear regression necessarily omits several key variables.
			For example, the book had an example with advertising budgets.
			In this hypothetical case, they say to take their sample by including data points that represent various markets already attempting sales.
			What was their advertising budget?
			What were their sales?
			These are the things looked at and tried to make fit a linear model.
			But what are we forgetting?
			People in different markets have different average income levels, different ideals, different cultural biases, different levels of gullibility (how likely ads even affect them at all), et cetera.
			There is a lot more at play than just the numbers you look at when you build your model.
			The book lumps these unknown variables into an error term, calling it <var>ε</var>.
			<var>ε</var> is essentially random, as far as the model goes, because it represents everything we&apos;re specifically not accounting for.
			As far as cost versus benefit, linear regression is a pretty good tool for prediction and using it shouldn&apos;t cost you that much.
			However, the simplicity of the linear regression model is still one of its biggest limitations, in my book.
			Even if you were to use sales and advertising budget figures for a single market over a period of time and use the same advertisement budget every time, you&apos;re not going to have the same sales figures every time (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			The book also mentions that when we use linear regression, we&apos;re assuming the two variables we&apos;re looking at have an even-somewhat linear relationship at all.
			The two may not be related, or there may be some other relationship, such as an exponential one.
			Linear regression simply can&apos;t account for non-linear relationships.
			The R<sup>2</sup> statistic at least allows us to measure how well regression worked for a given data set, though it doesn&apos;t allow us to fix the model in case of a bad fit.
			In multiple linear regression, the F-statistic can be used to determine if the data can be properly modelled using linear regression, but again, if it can&apos;t, it can&apos;t.
			We can partly account for some forms of non-linear relationships by adding some quadratic terms to our model, but this isn&apos;t going to work all the time.
			A linear model often works if you think like an engineer: it can be made fit &quot;well enough&quot;.
			When you try to model a relationship that isn&apos;t quite linear using a linear equation though, you run into an obvious problem: your predictions are only valid within the range in which your approximation line is actually near reality.
			If the true relationship isn&apos;t linear, reality will move toward your estimation line as you approach that range, and away as you leave that range (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			There&apos;s also always an issue with estimating variables in an equation based on observations.
			We can get rather close to reality with our models, but we can never find the absolute correct variable values to use.
			Our estimate takes the form of a least squares line, though multiple least squares lines exist depending on which data points we have.
			Compare that to the population regression line, as discussed by the book.
			We can never know the true population regression line, which represents absolute reality.
			This natural limitation comes from the field of statistics, and applies to anything we try to estimate using samples, not just cases of linear regression (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Another limitation of regression is that it becomes less accurate when we have fewer data points.
			With more data points, we can reduce the standard error and increase our confidence interval, but below a certain threshold, linear regression can&apos;t be trusted to estimate things very accurately at all (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Another issue we might have is that our error terms may not be as random as we think they are.
			If they instead display some sort of correlation, our model will have a tendency to misrepresent the reality.
			It can also cause us to have a false confidence in our modle, due to a low er standard error.
			Likewise, it can cause us to fail to realise certain predictor values aren&apos;t statistically significant, and give them undue weight (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Linear regression assumes a consistent variance.
			As we saw in a couple charts in the textbook, the variance may differ between different <code>x</code> values.
			Foe example, in figure 3.1, there&apos;s much less variance when the television advertisement budget is low.
			However, when it&apos;s high, there&apos;s much more variance.
			That leaves us unable to properly predict how accurate our linear model is (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Outliers can also badly skew our model.
			Often times, it&apos;s not the slope or the intercept that gets skewed much, but instead other things, such as confidence intervals.
			&quot;High leverage&quot; data points can have a similar effect as outliers.
			These are data points for which the <code>x</code> value is very different from the <code>x</code> values of other data points observed.
			However, instead of messing with things such as confidence intervals, a high leverage data point will instead more likely mess with the slope and intercept (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Finally, the last major limitation with linear regression is that it doesn&apos;t handle multiple predictor variables very well if those values tend to correlate with one another well.
			These predictor values are said to be colinear.
			When two or more values are colinear, it becomes exceedingly difficult to separate out the effects of one of them on the output values from the effects of the other(s).
			Because of this, a wide range of probable estimate values is available for each of the colinear predictor variables.
			The accuracy in estimation of the coefficients takes a major blow, which results in a high standard error for our model.
			It also makes it more difficult to detect irrelevant predictor values, as credit for results may be misattributed to one of the colinear predictor variables, when it should instead be attributed to another.
			Single colinearity is usually easy enough to spot if you&apos;re looking for it, but so-called multicollinearity - correlation between several predictor values together, though not as individual pairs - is very difficult to spot and deal with.
			In either case, the only way to fix the model is to eliminate the colinearity, either by omitting all but one predictor value from each set of colinear predictor variable or somehow finding a way to derive some value from the colinear predictor variables and use that in the model instead of any of them (James, Witten, Hastie, &amp; Tibshirani, 2013).
		</p>
		<p>
			Like any model, those produced using linear regression have their limits.
			We need to decide whether these models are fit for what we&apos;re trying to estimate, and if they are, work around these limitations in various ways to lessen the impact on said limitations on our ability to predict things accurately.
		</p>
		<div class="APA_references">
			<h3>References:</h3>
			<p>
				James, G., Witten, D., Hastie, T., &amp; Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. Retrieved from <a href="https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf"><code>https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf</code></a>
			</p>
		</div>
	</blockquote>
</section>
END
);
